In this project, we will do exploratory data analysis on a bank’s data set, which was obtained from https://archive.ics.uci.edu/ml/datasets/bank+marketing, and try to gain insights regarding the distribution of the dataset through the implementation of data visualization.
In this section, we will import the necessary libraries for this project.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.4 ✓ stringr 1.4.0
## ✓ tidyr 1.1.2 ✓ forcats 0.5.0
## ✓ readr 1.4.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
##
## Attaching package: 'glue'
## The following object is masked from 'package:dplyr':
##
## collapse
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
Here, we will read the csv data file into our IDE, do further inspection regarding our data, and perform suitable data cleansing for further processing.
Here we will perform an integral part of preparing our data into an understandable and complete format for the next step of machine learning.
## Rows: 4,521
## Columns: 17
## $ age <int> 30, 33, 35, 30, 59, 35, 36, 39, 41, 43, 39, 43, 36, 20, 31,…
## $ job <chr> "unemployed", "services", "management", "management", "blue…
## $ marital <chr> "married", "married", "single", "married", "married", "sing…
## $ education <chr> "primary", "secondary", "tertiary", "tertiary", "secondary"…
## $ default <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no",…
## $ balance <int> 1787, 4789, 1350, 1476, 0, 747, 307, 147, 221, -88, 9374, 2…
## $ housing <chr> "no", "yes", "yes", "yes", "yes", "no", "yes", "yes", "yes"…
## $ loan <chr> "no", "yes", "no", "yes", "no", "no", "no", "no", "no", "ye…
## $ contact <chr> "cellular", "cellular", "cellular", "unknown", "unknown", "…
## $ day <int> 19, 11, 16, 3, 5, 23, 14, 6, 14, 17, 20, 17, 13, 30, 29, 29…
## $ month <chr> "oct", "may", "apr", "jun", "may", "feb", "may", "may", "ma…
## $ duration <int> 79, 220, 185, 199, 226, 141, 341, 151, 57, 313, 273, 113, 3…
## $ campaign <int> 1, 1, 1, 4, 1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 5, 1, 1, 1,…
## $ pdays <int> -1, 339, 330, -1, -1, 176, 330, -1, -1, 147, -1, -1, -1, -1…
## $ previous <int> 0, 4, 1, 0, 0, 3, 2, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 2, 0, 1,…
## $ poutcome <chr> "unknown", "failure", "failure", "unknown", "unknown", "fai…
## $ y <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no",…
Input variables:
## age job marital education
## Min. :19.00 Length:4521 Length:4521 Length:4521
## 1st Qu.:33.00 Class :character Class :character Class :character
## Median :39.00 Mode :character Mode :character Mode :character
## Mean :41.17
## 3rd Qu.:49.00
## Max. :87.00
## default balance housing loan
## Length:4521 Min. :-3313 Length:4521 Length:4521
## Class :character 1st Qu.: 69 Class :character Class :character
## Mode :character Median : 444 Mode :character Mode :character
## Mean : 1423
## 3rd Qu.: 1480
## Max. :71188
## contact day month duration
## Length:4521 Min. : 1.00 Length:4521 Min. : 4
## Class :character 1st Qu.: 9.00 Class :character 1st Qu.: 104
## Mode :character Median :16.00 Mode :character Median : 185
## Mean :15.92 Mean : 264
## 3rd Qu.:21.00 3rd Qu.: 329
## Max. :31.00 Max. :3025
## campaign pdays previous poutcome
## Min. : 1.000 Min. : -1.00 Min. : 0.0000 Length:4521
## 1st Qu.: 1.000 1st Qu.: -1.00 1st Qu.: 0.0000 Class :character
## Median : 2.000 Median : -1.00 Median : 0.0000 Mode :character
## Mean : 2.794 Mean : 39.77 Mean : 0.5426
## 3rd Qu.: 3.000 3rd Qu.: -1.00 3rd Qu.: 0.0000
## Max. :50.000 Max. :871.00 Max. :25.0000
## y
## Length:4521
## Class :character
## Mode :character
##
##
##
## age job marital education default balance housing loan
## 0 0 0 0 0 0 0 0
## contact day month duration campaign pdays previous poutcome
## 0 0 0 0 0 0 0 0
## y
## 0
In this section of the report, we will make necessary adjustments to the data set in order to make it feasible for further processing and data exploration.
As can be seen from the data description above which was provided by the source, the columns c(“job”, “marital”, “education”, “default”, “housing”, “loan”, “contact”, “month”, “poutcome”, “y”) should have had categorical values. However, in the above inspection by using the glimpse function, it can be seen that they are still character data types. Hence, below they are transformed into the form of factors.
bank <- bank %>%
mutate_at(c("job", "marital", "education", "default", "housing", "loan", "contact", "month", "poutcome", "y"), as.factor)
str(bank)## 'data.frame': 4521 obs. of 17 variables:
## $ age : int 30 33 35 30 59 35 36 39 41 43 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ...
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ...
## $ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 3 2 3 1 ...
## $ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ balance : int 1787 4789 1350 1476 0 747 307 147 221 -88 ...
## $ housing : Factor w/ 2 levels "no","yes": 1 2 2 2 2 1 2 2 2 2 ...
## $ loan : Factor w/ 2 levels "no","yes": 1 2 1 2 1 1 1 1 1 2 ...
## $ contact : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 3 1 ...
## $ day : int 19 11 16 3 5 23 14 6 14 17 ...
## $ month : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 9 9 9 1 ...
## $ duration : int 79 220 185 199 226 141 341 151 57 313 ...
## $ campaign : int 1 1 1 4 1 2 1 2 2 1 ...
## $ pdays : int -1 339 330 -1 -1 176 330 -1 -1 147 ...
## $ previous : int 0 4 1 0 0 3 2 0 0 2 ...
## $ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ...
## $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
Below, we iterate through each columns to check the levels of each categorical variables for every factor data types in our data set.
for (column in names(bank)) {
if (is.factor(bank[,column])) {
print(paste(column, ":", sep = " "))
print(levels(bank[,column]))
}
}## [1] "job :"
## [1] "admin." "blue-collar" "entrepreneur" "housemaid"
## [5] "management" "retired" "self-employed" "services"
## [9] "student" "technician" "unemployed" "unknown"
## [1] "marital :"
## [1] "divorced" "married" "single"
## [1] "education :"
## [1] "primary" "secondary" "tertiary" "unknown"
## [1] "default :"
## [1] "no" "yes"
## [1] "housing :"
## [1] "no" "yes"
## [1] "loan :"
## [1] "no" "yes"
## [1] "contact :"
## [1] "cellular" "telephone" "unknown"
## [1] "month :"
## [1] "apr" "aug" "dec" "feb" "jan" "jul" "jun" "mar" "may" "nov" "oct" "sep"
## [1] "poutcome :"
## [1] "failure" "other" "success" "unknown"
## [1] "y :"
## [1] "no" "yes"
In this section, we will be performing concepts of feature engineering and data visualization in effort to successfully demonstrate exploratory data analysis concepts towards the audience.
Here, we select only a small group of columns that will be needed to produce the boxplot of job - balance distribution across our dataframe. Here, we perform a label encoding to distinguish the different jobs using a new variable named job_id to enhance the aesthetics in our plotting by preventing colliding jobs in the x-axis of the label.
balance_job_agg <- bank %>%
mutate(job_id = factor(as.numeric(bank$job))) %>%
select(c("balance", "job", "job_id"))
balance_job_agg %>% head()With the code chunk below, we subset the necessary columns with some adjustments for plotting job_id against number of acceptance of the product.
yes_job_agg <- bank %>%
mutate(job_id = factor(as.numeric(bank$job))) %>%
filter(y == "yes") %>%
select(c("job", "job_id", "y")) %>%
group_by(job_id) %>%
count(y) %>%
mutate(tooltip = glue("Total Acceptance: {(n)}"))
yes_job_agg %>% head()no_job_agg <- bank %>%
mutate(job_id = factor(as.numeric(bank$job))) %>%
filter(y == "no") %>%
select(c("job", "job_id", "y")) %>%
group_by(job_id) %>%
count(y) %>%
mutate(tooltip = glue("Total Rejections: {(n)}"))
no_job_agg %>% head()yes_job_agg$ratio <- yes_job_agg$n/no_job_agg$n
acceptance_rate <- bank %>%
mutate(job_id = factor(as.numeric(bank$job))) %>%
select(c("job", "job_id")) %>%
group_by(job_id) %>%
inner_join(yes_job_agg) %>%
select(c("job_id", "job", "ratio")) %>%
mutate(tooltip = glue("Ratio: {(ratio)}%")) %>%
arrange(job_id)## Joining, by = "job_id"
yes_count <- bank %>%
group_by(marital) %>%
filter(y=="yes") %>%
count(y)
no_count <- bank %>%
group_by(marital) %>%
filter(y=="no") %>%
count(y)
marital_y_agg <- rbind(yes_count, no_count)
marital_y_aggThe following is the label encoding that will be used in visualizing the distributions involving job_id.
A visualization using boxplot towards the distribution of balance among different jobs. From the below interpretation, we can notice the presence of many outliers that might skew the distribution upwards. However, there are also some extreme negative outliers that can be found in our dataframe, namely in the entrepreneur and self-employed sectors.
ggplot(balance_job_agg, aes(x = job_id, y = balance)) +
geom_boxplot(aes(fill = balance)) +
labs(title = "Balance According to Job", x= "job_id", y= "Balance",
subtitle = "Avg. price indicated by the dotted red line") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_hline(yintercept = mean(balance_job_agg$balance), color = "red", linetype = 5)A plotting using geom_col to view the product acceptance distribution among different kind of jobs. It seems that the product is most well-perceived by the people in management field as reflected on the graph.
p <- ggplot(yes_job_agg, aes(x = job_id, y= n, fill = n, text = tooltip)) +
geom_col(position = "identity") +
scale_fill_gradient(low = "#e4333e", high = "#52171a") +
labs(title = "Product Acceptance Distribution Among Jobs", x ="job_id", y= "amount")+
theme_minimal()
ggplotly(p, tooltip = c('text'))The following is the product rejection distribution among different jobs as opposed to the above shown graph.
p <- ggplot(no_job_agg, aes(x = job_id, y= n, fill = n, text = tooltip)) +
geom_col(position = "identity") +
scale_fill_gradient(low = "#69eeee", high = "#157373") +
labs(title = "Product Rejection Distribution Among Jobs", x ="job_id", y= "amount")+
theme_minimal()
ggplotly(p, tooltip = c('text'))acc_p <- ggplot(acceptance_rate, aes(x = job_id, y = ratio, fill = ratio, text = tooltip)) +
geom_col() +
facet_grid( scales = "free_y")+
geom_point(aes(col=ratio))+
labs(title = "Acceptance Rate Among Each Job", x="job_id", y= "Ratio")+
coord_flip()
ggplotly(acc_p, tooltip = c("text"))The following graph attempts to illustrate the same product acceptance distributions on various marital status.
ggplot(marital_y_agg,aes(x = marital,y = n))+
geom_col(aes(fill = y), position = "dodge") +
labs (title = "Product Acceptance Distribution Among Marital Groups", x = "Marital Status", y = "Amount of Acceptance", fill ="Acceptance")+
theme(plot.title = element_text(hjust = 0.5))Based on the above visualizations, it seems that targeting single, retired persons might be the most beneficial for the company in terms of obtaining customers for their product. The chart which shows that people in management position being the category with the most acceptance towards the bank’s product can be inferred to be skewed due to the high amount of population within the sector as we can see the acceptance ratio is only about 1.6.